Pesquisa | Portal Regional da BVS

1.

Node-degree aware edge sampling mitigates inflated classification performance in biomedical random walk-based graph representation learning.

Cappelletti, Luca; Rekerle, Lauren; Fontana, Tommaso; Hansen, Peter; Casiraghi, Elena; Ravanmehr, Vida; Mungall, Christopher J; Yang, Jeremy J; Spranger, Leonard; Karlebach, Guy; Caufield, J Harry; Carmody, Leigh; Coleman, Ben; Oprea, Tudor I; Reese, Justin; Valentini, Giorgio; Robinson, Peter N.

Bioinform Adv ; 4(1): vbae036, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38577542

RESUMO

Motivation: Graph representation learning is a family of related approaches that learn low-dimensional vector representations of nodes and other graph elements called embeddings. Embeddings approximate characteristics of the graph and can be used for a variety of machine-learning tasks such as novel edge prediction. For many biomedical applications, partial knowledge exists about positive edges that represent relationships between pairs of entities, but little to no knowledge is available about negative edges that represent the explicit lack of a relationship between two nodes. For this reason, classification procedures are forced to assume that the vast majority of unlabeled edges are negative. Existing approaches to sampling negative edges for training and evaluating classifiers do so by uniformly sampling pairs of nodes. Results: We show here that this sampling strategy typically leads to sets of positive and negative examples with imbalanced node degree distributions. Using representative heterogeneous biomedical knowledge graph and random walk-based graph machine learning, we show that this strategy substantially impacts classification performance. If users of graph machine-learning models apply the models to prioritize examples that are drawn from approximately the same distribution as the positive examples are, then performance of models as estimated in the validation phase may be artificially inflated. We present a degree-aware node sampling approach that mitigates this effect and is simple to implement. Availability and implementation: Our code and data are publicly available at https://github.com/monarch-initiative/negativeExampleSelection.

2.

Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES): a method for populating knowledge bases using zero-shot learning.

Caufield, J Harry; Hegde, Harshad; Emonet, Vincent; Harris, Nomi L; Joachimiak, Marcin P; Matentzoglu, Nicolas; Kim, HyeongSik; Moxon, Sierra; Reese, Justin T; Haendel, Melissa A; Robinson, Peter N; Mungall, Christopher J.

Bioinformatics ; 40(3)2024 Mar 04.

Artigo em Inglês | MEDLINE | ID: mdl-38383067

RESUMO

MOTIVATION: Creating knowledge bases and ontologies is a time consuming task that relies on manual curation. AI/NLP approaches can assist expert curators in populating these knowledge bases, but current approaches rely on extensive training data, and are not able to populate arbitrarily complex nested knowledge schemas. RESULTS: Here we present Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES), a Knowledge Extraction approach that relies on the ability of Large Language Models (LLMs) to perform zero-shot learning and general-purpose query answering from flexible prompts and return information conforming to a specified schema. Given a detailed, user-defined knowledge schema and an input text, SPIRES recursively performs prompt interrogation against an LLM to obtain a set of responses matching the provided schema. SPIRES uses existing ontologies and vocabularies to provide identifiers for matched elements. We present examples of applying SPIRES in different domains, including extraction of food recipes, multi-species cellular signaling pathways, disease treatments, multi-step drug mechanisms, and chemical to disease relationships. Current SPIRES accuracy is comparable to the mid-range of existing Relation Extraction methods, but greatly surpasses an LLM's native capability of grounding entities with unique identifiers. SPIRES has the advantage of easy customization, flexibility, and, crucially, the ability to perform new tasks in the absence of any new training data. This method supports a general strategy of leveraging the language interpreting capabilities of LLMs to assemble knowledge bases, assisting manual knowledge curation and acquisition while supporting validation with publicly-available databases and ontologies external to the LLM. AVAILABILITY AND IMPLEMENTATION: SPIRES is available as part of the open source OntoGPT package: https://github.com/monarch-initiative/ontogpt.

Assuntos

Bases de Conhecimento , Semântica , Bases de Dados Factuais

3.

The Human Phenotype Ontology in 2024: phenotypes around the world.

Gargano, Michael A; Matentzoglu, Nicolas; Coleman, Ben; Addo-Lartey, Eunice B; Anagnostopoulos, Anna V; Anderton, Joel; Avillach, Paul; Bagley, Anita M; Bakstein, Eduard; Balhoff, James P; Baynam, Gareth; Bello, Susan M; Berk, Michael; Bertram, Holli; Bishop, Somer; Blau, Hannah; Bodenstein, David F; Botas, Pablo; Boztug, Kaan; Cady, Jolana; Callahan, Tiffany J; Cameron, Rhiannon; Carbon, Seth J; Castellanos, Francisco; Caufield, J Harry; Chan, Lauren E; Chute, Christopher G; Cruz-Rojo, Jaime; Dahan-Oliel, Noémi; Davids, Jon R; de Dieuleveult, Maud; de Souza, Vinicius; de Vries, Bert B A; de Vries, Esther; DePaulo, J Raymond; Derfalvi, Beata; Dhombres, Ferdinand; Diaz-Byrd, Claudia; Dingemans, Alexander J M; Donadille, Bruno; Duyzend, Michael; Elfeky, Reem; Essaid, Shahim; Fabrizzi, Carolina; Fico, Giovanna; Firth, Helen V; Freudenberg-Hua, Yun; Fullerton, Janice M; Gabriel, Davera L; Gilmour, Kimberly.

Nucleic Acids Res ; 52(D1): D1333-D1346, 2024 Jan 05.

Artigo em Inglês | MEDLINE | ID: mdl-37953324

RESUMO

The Human Phenotype Ontology (HPO) is a widely used resource that comprehensively organizes and defines the phenotypic features of human disease, enabling computational inference and supporting genomic and phenotypic analyses through semantic similarity and machine learning algorithms. The HPO has widespread applications in clinical diagnostics and translational research, including genomic diagnostics, gene-disease discovery, and cohort analytics. In recent years, groups around the world have developed translations of the HPO from English to other languages, and the HPO browser has been internationalized, allowing users to view HPO term labels and in many cases synonyms and definitions in ten languages in addition to English. Since our last report, a total of 2239 new HPO terms and 49235 new HPO annotations were developed, many in collaboration with external groups in the fields of psychiatry, arthrogryposis, immunology and cardiology. The Medical Action Ontology (MAxO) is a new effort to model treatments and other measures taken for clinical management. Finally, the HPO consortium is contributing to efforts to integrate the HPO and the GA4GH Phenopacket Schema into electronic health records (EHRs) with the goal of more standardized and computable integration of rare disease data in EHRs.

Assuntos

Ontologias Biológicas , Humanos , Fenótipo , Genômica , Algoritmos , Doenças Raras

4.

The Monarch Initiative in 2024: an analytic platform integrating phenotypes, genes and diseases across species.

Putman, Tim E; Schaper, Kevin; Matentzoglu, Nicolas; Rubinetti, Vincent P; Alquaddoomi, Faisal S; Cox, Corey; Caufield, J Harry; Elsarboukh, Glass; Gehrke, Sarah; Hegde, Harshad; Reese, Justin T; Braun, Ian; Bruskiewich, Richard M; Cappelletti, Luca; Carbon, Seth; Caron, Anita R; Chan, Lauren E; Chute, Christopher G; Cortes, Katherina G; De Souza, Vinícius; Fontana, Tommaso; Harris, Nomi L; Hartley, Emily L; Hurwitz, Eric; Jacobsen, Julius O B; Krishnamurthy, Madan; Laraway, Bryan J; McLaughlin, James A; McMurry, Julie A; Moxon, Sierra A T; Mullen, Kathleen R; O'Neil, Shawn T; Shefchek, Kent A; Stefancsik, Ray; Toro, Sabrina; Vasilevsky, Nicole A; Walls, Ramona L; Whetzel, Patricia L; Osumi-Sutherland, David; Smedley, Damian; Robinson, Peter N; Mungall, Christopher J; Haendel, Melissa A; Munoz-Torres, Monica C.

Nucleic Acids Res ; 52(D1): D938-D949, 2024 Jan 05.

Artigo em Inglês | MEDLINE | ID: mdl-38000386

RESUMO

Bridging the gap between genetic variations, environmental determinants, and phenotypic outcomes is critical for supporting clinical diagnosis and understanding mechanisms of diseases. It requires integrating open data at a global scale. The Monarch Initiative advances these goals by developing open ontologies, semantic data models, and knowledge graphs for translational research. The Monarch App is an integrated platform combining data about genes, phenotypes, and diseases across species. Monarch's APIs enable access to carefully curated datasets and advanced analysis tools that support the understanding and diagnosis of disease for diverse applications such as variant prioritization, deep phenotyping, and patient profile-matching. We have migrated our system into a scalable, cloud-based infrastructure; simplified Monarch's data ingestion and knowledge graph integration systems; enhanced data mapping and integration standards; and developed a new user interface with novel search and graph navigation features. Furthermore, we advanced Monarch's analytic tools by developing a customized plugin for OpenAI's ChatGPT to increase the reliability of its responses about phenotypic data, allowing us to interrogate the knowledge in the Monarch graph using state-of-the-art Large Language Models. The resources of the Monarch Initiative can be found at monarchinitiative.org and its corresponding code repository at github.com/monarch-initiative/monarch-app.

Assuntos

Bases de Dados Factuais , Doença , Genes , Fenótipo , Humanos , Internet , Bases de Dados Factuais/normas , Software , Genes/genética , Doença/genética

5.

On the limitations of large language models in clinical diagnosis.

Reese, Justin T; Danis, Daniel; Caufield, J Harry; Groza, Tudor; Casiraghi, Elena; Valentini, Giorgio; Mungall, Christopher J; Robinson, Peter N.

medRxiv ; 2024 Feb 26.

Artigo em Inglês | MEDLINE | ID: mdl-37503093

RESUMO

Objective: Large Language Models such as GPT-4 previously have been applied to differential diagnostic challenges based on published case reports. Published case reports have a sophisticated narrative style that is not readily available from typical electronic health records (EHR). Furthermore, even if such a narrative were available in EHRs, privacy requirements would preclude sending it outside the hospital firewall. We therefore tested a method for parsing clinical texts to extract ontology terms and programmatically generating prompts that by design are free of protected health information. Materials and Methods: We investigated different methods to prepare prompts from 75 recently published case reports. We transformed the original narratives by extracting structured terms representing phenotypic abnormalities, comorbidities, treatments, and laboratory tests and creating prompts programmatically. Results: Performance of all of these approaches was modest, with the correct diagnosis ranked first in only 5.3-17.6% of cases. The performance of the prompts created from structured data was substantially worse than that of the original narrative texts, even if additional information was added following manual review of term extraction. Moreover, different versions of GPT-4 demonstrated substantially different performance on this task. Discussion: The sensitivity of the performance to the form of the prompt and the instability of results over two GPT-4 versions represent important current limitations to the use of GPT-4 to support diagnosis in real-life clinical settings. Conclusion: Research is needed to identify the best methods for creating prompts from typically available clinical data to support differential diagnostics.

6.

A Knowledge Graph Approach to Elucidate the Role of Organellar Pathways in Disease via Biomedical Reports.

Pelletier, Alexander R; Steinecke, Dylan; Sigdel, Dibakar; Adam, Irsyad; Caufield, J Harry; Guevara-Gonzalez, Vladimir; Ramirez, Joseph; Verma, Aarushi; Bali, Kaitlyn; Downs, Katherine; Wang, Wei; Bui, Alex; Ping, Peipei.

J Vis Exp ; (200)2023 10 13.

Artigo em Inglês | MEDLINE | ID: mdl-37902366

RESUMO

The rapidly increasing and vast quantities of biomedical reports, each containing numerous entities and rich information, represent a rich resource for biomedical text-mining applications. These tools enable investigators to integrate, conceptualize, and translate these discoveries to uncover new insights into disease pathology and therapeutics. In this protocol, we present CaseOLAP LIFT, a new computational pipeline to investigate cellular components and their disease associations by extracting user-selected information from text datasets (e.g., biomedical literature). The software identifies sub-cellular proteins and their functional partners within disease-relevant documents. Additional disease-relevant documents are identified via the software's label imputation method. To contextualize the resulting protein-disease associations and to integrate information from multiple relevant biomedical resources, a knowledge graph is automatically constructed for further analyses. We present one use case with a corpus of ~34 million text documents downloaded online to provide an example of elucidating the role of mitochondrial proteins in distinct cardiovascular disease phenotypes using this method. Furthermore, a deep learning model was applied to the resulting knowledge graph to predict previously unreported relationships between proteins and disease, resulting in 1,583 associations with predicted probabilities >0.90 and with an area under the receiver operating characteristic curve (AUROC) of 0.91 on the test set. This software features a highly customizable and automated workflow, with a broad scope of raw data available for analysis; therefore, using this method, protein-disease associations can be identified with enhanced reliability within a text corpus.

Assuntos

Reconhecimento Automatizado de Padrão , Software , Reprodutibilidade dos Testes , Mineração de Dados/métodos

7.

Gene Set Summarization using Large Language Models.

Joachimiak, Marcin P; Caufield, J Harry; Harris, Nomi L; Kim, Hyeongsik; Mungall, Christopher J.

ArXiv ; 2023 May 25.

Artigo em Inglês | MEDLINE | ID: mdl-37292480

RESUMO

Molecular biologists frequently interpret gene lists derived from high-throughput experiments and computational analysis. This is typically done as a statistical enrichment analysis that measures the over- or under-representation of biological function terms associated with genes or their properties, based on curated assertions from a knowledge base (KB) such as the Gene Ontology (GO). Interpreting gene lists can also be framed as a textual summarization task, enabling the use of Large Language Models (LLMs), potentially utilizing scientific texts directly and avoiding reliance on a KB. We developed SPINDOCTOR (Structured Prompt Interpolation of Natural Language Descriptions of Controlled Terms for Ontology Reporting), a method that uses GPT models to perform gene set function summarization as a complement to standard enrichment analysis. This method can use different sources of gene functional information: (1) structured text derived from curated ontological KB annotations, (2) ontology-free narrative gene summaries, or (3) direct model retrieval. We demonstrate that these methods are able to generate plausible and biologically valid summary GO term lists for gene sets. However, GPT-based approaches are unable to deliver reliable scores or p-values and often return terms that are not statistically significant. Crucially, these methods were rarely able to recapitulate the most precise and informative term from standard enrichment, likely due to an inability to generalize and reason using an ontology. Results are highly nondeterministic, with minor variations in prompt resulting in radically different term lists. Our results show that at this point, LLM-based methods are unsuitable as a replacement for standard term enrichment analysis and that manual curation of ontological assertions remains necessary.

8.

KG-Hub-building and exchanging biological knowledge graphs.

Caufield, J Harry; Putman, Tim; Schaper, Kevin; Unni, Deepak R; Hegde, Harshad; Callahan, Tiffany J; Cappelletti, Luca; Moxon, Sierra A T; Ravanmehr, Vida; Carbon, Seth; Chan, Lauren E; Cortes, Katherina; Shefchek, Kent A; Elsarboukh, Glass; Balhoff, Jim; Fontana, Tommaso; Matentzoglu, Nicolas; Bruskiewich, Richard M; Thessen, Anne E; Harris, Nomi L; Munoz-Torres, Monica C; Haendel, Melissa A; Robinson, Peter N; Joachimiak, Marcin P; Mungall, Christopher J; Reese, Justin T.

Bioinformatics ; 39(7)2023 07 01.

Artigo em Inglês | MEDLINE | ID: mdl-37389415

RESUMO

MOTIVATION: Knowledge graphs (KGs) are a powerful approach for integrating heterogeneous data and making inferences in biology and many other domains, but a coherent solution for constructing, exchanging, and facilitating the downstream use of KGs is lacking. RESULTS: Here we present KG-Hub, a platform that enables standardized construction, exchange, and reuse of KGs. Features include a simple, modular extract-transform-load pattern for producing graphs compliant with Biolink Model (a high-level data model for standardizing biological data), easy integration of any OBO (Open Biological and Biomedical Ontologies) ontology, cached downloads of upstream data sources, versioned and automatically updated builds with stable URLs, web-browsable storage of KG artifacts on cloud infrastructure, and easy reuse of transformed subgraphs across projects. Current KG-Hub projects span use cases including COVID-19 research, drug repurposing, microbial-environmental interactions, and rare disease research. KG-Hub is equipped with tooling to easily analyze and manipulate KGs. KG-Hub is also tightly integrated with graph machine learning (ML) tools which allow automated graph ML, including node embeddings and training of models for link prediction and node classification. AVAILABILITY AND IMPLEMENTATION: https://kghub.org.

Assuntos

Ontologias Biológicas , COVID-19 , Humanos , Reconhecimento Automatizado de Padrão , Doenças Raras , Aprendizado de Máquina

9.

Generalisable long COVID subtypes: findings from the NIH N3C and RECOVER programmes.

Reese, Justin T; Blau, Hannah; Casiraghi, Elena; Bergquist, Timothy; Loomba, Johanna J; Callahan, Tiffany J; Laraway, Bryan; Antonescu, Corneliu; Coleman, Ben; Gargano, Michael; Wilkins, Kenneth J; Cappelletti, Luca; Fontana, Tommaso; Ammar, Nariman; Antony, Blessy; Murali, T M; Caufield, J Harry; Karlebach, Guy; McMurry, Julie A; Williams, Andrew; Moffitt, Richard; Banerjee, Jineta; Solomonides, Anthony E; Davis, Hannah; Kostka, Kristin; Valentini, Giorgio; Sahner, David; Chute, Christopher G; Madlock-Brown, Charisse; Haendel, Melissa A; Robinson, Peter N.

EBioMedicine ; 87: 104413, 2023 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-36563487

RESUMO

BACKGROUND: Stratification of patients with post-acute sequelae of SARS-CoV-2 infection (PASC, or long COVID) would allow precision clinical management strategies. However, long COVID is incompletely understood and characterised by a wide range of manifestations that are difficult to analyse computationally. Additionally, the generalisability of machine learning classification of COVID-19 clinical outcomes has rarely been tested. METHODS: We present a method for computationally modelling PASC phenotype data based on electronic healthcare records (EHRs) and for assessing pairwise phenotypic similarity between patients using semantic similarity. Our approach defines a nonlinear similarity function that maps from a feature space of phenotypic abnormalities to a matrix of pairwise patient similarity that can be clustered using unsupervised machine learning. FINDINGS: We found six clusters of PASC patients, each with distinct profiles of phenotypic abnormalities, including clusters with distinct pulmonary, neuropsychiatric, and cardiovascular abnormalities, and a cluster associated with broad, severe manifestations and increased mortality. There was significant association of cluster membership with a range of pre-existing conditions and measures of severity during acute COVID-19. We assigned new patients from other healthcare centres to clusters by maximum semantic similarity to the original patients, and showed that the clusters were generalisable across different hospital systems. The increased mortality rate originally identified in one cluster was consistently observed in patients assigned to that cluster in other hospital systems. INTERPRETATION: Semantic phenotypic clustering provides a foundation for assigning patients to stratified subgroups for natural history or therapy studies on PASC. FUNDING: NIH (TR002306/OT2HL161847-01/OD011883/HG010860), U.S.D.O.E. (DE-AC02-05CH11231), Donald A. Roux Family Fund at Jackson Laboratory, Marsico Family at CU Anschutz.

Assuntos

COVID-19 , Síndrome Pós-COVID-19 Aguda , Humanos , Progressão da Doença , SARS-CoV-2

10.

Biolink Model: A universal schema for knowledge graphs in clinical, biomedical, and translational science.

Unni, Deepak R; Moxon, Sierra A T; Bada, Michael; Brush, Matthew; Bruskiewich, Richard; Caufield, J Harry; Clemons, Paul A; Dancik, Vlado; Dumontier, Michel; Fecho, Karamarie; Glusman, Gustavo; Hadlock, Jennifer J; Harris, Nomi L; Joshi, Arpita; Putman, Tim; Qin, Guangrong; Ramsey, Stephen A; Shefchek, Kent A; Solbrig, Harold; Soman, Karthik; Thessen, Anne E; Haendel, Melissa A; Bizon, Chris; Mungall, Christopher J.

Clin Transl Sci ; 15(8): 1848-1855, 2022 08.

Artigo em Inglês | MEDLINE | ID: mdl-36125173

RESUMO

Within clinical, biomedical, and translational science, an increasing number of projects are adopting graphs for knowledge representation. Graph-based data models elucidate the interconnectedness among core biomedical concepts, enable data structures to be easily updated, and support intuitive queries, visualizations, and inference algorithms. However, knowledge discovery across these "knowledge graphs" (KGs) has remained difficult. Data set heterogeneity and complexity; the proliferation of ad hoc data formats; poor compliance with guidelines on findability, accessibility, interoperability, and reusability; and, in particular, the lack of a universally accepted, open-access model for standardization across biomedical KGs has left the task of reconciling data sources to downstream consumers. Biolink Model is an open-source data model that can be used to formalize the relationships between data structures in translational science. It incorporates object-oriented classification and graph-oriented features. The core of the model is a set of hierarchical, interconnected classes (or categories) and relationships between them (or predicates) representing biomedical entities such as gene, disease, chemical, anatomic structure, and phenotype. The model provides class and edge attributes and associations that guide how entities should relate to one another. Here, we highlight the need for a standardized data model for KGs, describe Biolink Model, and compare it with other models. We demonstrate the utility of Biolink Model in various initiatives, including the Biomedical Data Translator Consortium and the Monarch Initiative, and show how it has supported easier integration and interoperability of biomedical KGs, bringing together knowledge from multiple sources and helping to realize the goals of translational science.

Assuntos

Reconhecimento Automatizado de Padrão , Ciência Translacional Biomédica , Conhecimento

11.

ZapG (YhcB/DUF1043), a novel cell division protein in gamma-proteobacteria linking the Z-ring to septal peptidoglycan synthesis.

Mehla, Jitender; Liechti, George; Morgenstein, Randy M; Caufield, J Harry; Hosseinnia, Ali; Gagarinova, Alla; Phanse, Sadhna; Goodacre, Norman; Brockett, Mary; Sakhawalkar, Neha; Babu, Mohan; Xiao, Rong; Montelione, Gaetano T; Vorobiev, Sergey; den Blaauwen, Tanneke; Hunt, John F; Uetz, Peter.

J Biol Chem ; 296: 100700, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-33895137

RESUMO

YhcB, a poorly understood protein conserved across gamma-proteobacteria, contains a domain of unknown function (DUF1043) and an N-terminal transmembrane domain. Here, we used an integrated approach including X-ray crystallography, genetics, and molecular biology to investigate the function and structure of YhcB. The Escherichia coli yhcB KO strain does not grow at 45 °C and is hypersensitive to cell wall-acting antibiotics, even in the stationary phase. The deletion of yhcB leads to filamentation, abnormal FtsZ ring formation, and aberrant septum development. The Z-ring is essential for the positioning of the septa and the initiation of cell division. We found that YhcB interacts with proteins of the divisome (e.g., FtsI, FtsQ) and elongasome (e.g., RodZ, RodA). Seven of these interactions are also conserved in Yersinia pestis and/or Vibrio cholerae. Furthermore, we mapped the amino acid residues likely involved in the interactions of YhcB with FtsI and RodZ. The 2.8 Å crystal structure of the cytosolic domain of Haemophilus ducreyi YhcB shows a unique tetrameric α-helical coiled-coil structure likely to be involved in linking the Z-ring to the septal peptidoglycan-synthesizing complexes. In summary, YhcB is a conserved and conditionally essential protein that plays a role in cell division and consequently affects envelope biogenesis. Based on these findings, we propose to rename YhcB to ZapG (Z-ring-associated protein G). This study will serve as a starting point for future studies on this protein family and on how cells transit from exponential to stationary survival.

Assuntos

Proteínas de Bactérias/metabolismo , Peptidoglicano/biossíntese , Proteobactérias/citologia , Proteobactérias/metabolismo , Proteínas de Bactérias/química , Divisão Celular , Cristalografia por Raios X , Modelos Moleculares , Conformação Proteica

12.

A Second Look at FAIR in Proteomic Investigations.

Caufield, J Harry; Fu, John; Wang, Ding; Guevara-Gonzalez, Vladimir; Wang, Wei; Ping, Peipei.

J Proteome Res ; 20(5): 2182-2186, 2021 05 07.

Artigo em Inglês | MEDLINE | ID: mdl-33719446

RESUMO

Proteomics is, by definition, comprehensive and large-scale, seeking to unravel ome-level protein features with phenotypic information on an entire system, an organ, cells, or organisms. This scope consistently involves and extends beyond single experiments. Multitudinous resources now exist to assist in making the results of proteomics experiments more findable, accessible, interoperable, and reusable (FAIR), yet many tools are awaiting to be adopted by our community. Here we highlight strategies for expanding the impact of proteomics data beyond single studies. We show how linking specific terminologies, identifiers, and text (words) can unify individual data points across a wide spectrum of studies and, more importantly, how this approach may potentially reveal novel relationships. In this effort, we explain how data sets and methods can be rendered more linkable and how this maximizes their value. We also include a discussion on how data linking strategies benefit stakeholders across the proteomics community and beyond.

Assuntos

Proteômica

13.

New advances in extracting and learning from protein-protein interactions within unstructured biomedical text data.

Caufield, J Harry; Ping, Peipei.

Emerg Top Life Sci ; 3(4): 357-369, 2019 Aug 16.

Artigo em Inglês | MEDLINE | ID: mdl-33523203

RESUMO

Protein-protein interactions, or PPIs, constitute a basic unit of our understanding of protein function. Though substantial effort has been made to organize PPI knowledge into structured databases, maintenance of these resources requires careful manual curation. Even then, many PPIs remain uncurated within unstructured text data. Extracting PPIs from experimental research supports assembly of PPI networks and highlights relationships crucial to elucidating protein functions. Isolating specific protein-protein relationships from numerous documents is technically demanding by both manual and automated means. Recent advances in the design of these methods have leveraged emerging computational developments and have demonstrated impressive results on test datasets. In this review, we discuss recent developments in PPI extraction from unstructured biomedical text. We explore the historical context of these developments, recent strategies for integrating and comparing PPI data, and their application to advancing the understanding of protein function. Finally, we describe the challenges facing the application of PPI mining to the text concerning protein families, using the multifunctional 14-3-3 protein family as an example.

14.

A reference set of curated biomedical data and metadata from clinical case reports.

Caufield, J Harry; Zhou, Yijiang; Garlid, Anders O; Setty, Shaun P; Liem, David A; Cao, Quan; Lee, Jessica M; Murali, Sanjana; Spendlove, Sarah; Wang, Wei; Zhang, Li; Sun, Yizhou; Bui, Alex; Hermjakob, Henning; Watson, Karol E; Ping, Peipei.

Sci Data ; 5: 180258, 2018 11 20.

Artigo em Inglês | MEDLINE | ID: mdl-30457569

RESUMO

Clinical case reports (CCRs) provide an important means of sharing clinical experiences about atypical disease phenotypes and new therapies. However, published case reports contain largely unstructured and heterogeneous clinical data, posing a challenge to mining relevant information. Current indexing approaches generally concern document-level features and have not been specifically designed for CCRs. To address this disparity, we developed a standardized metadata template and identified text corresponding to medical concepts within 3,100 curated CCRs spanning 15 disease groups and more than 750 reports of rare diseases. We also prepared a subset of metadata on reports on selected mitochondrial diseases and assigned ICD-10 diagnostic codes to each. The resulting resource, Metadata Acquired from Clinical Case Reports (MACCRs), contains text associated with high-level clinical concepts, including demographics, disease presentation, treatments, and outcomes for each report. Our template and MACCR set render CCRs more findable, accessible, interoperable, and reusable (FAIR) while serving as valuable resources for key user groups, including researchers, physician investigators, clinicians, data scientists, and those shaping government policies for clinical trials.

Assuntos

Estudos Clínicos como Assunto , Curadoria de Dados , Metadados , Biologia Computacional , Análise de Dados , Curadoria de Dados/métodos , Curadoria de Dados/normas , Humanos , Metadados/normas

15.

Making the Right Choice: Critical Parameters of the Y2H Systems.

Mehla, Jitender; Caufield, J Harry; Uetz, Peter.

Methods Mol Biol ; 1794: 17-28, 2018.

Artigo em Inglês | MEDLINE | ID: mdl-29855948

RESUMO

Two-hybrid methods remain among the most preferred choices for detecting protein-protein interactions (PPIs) and much of the PPI data in databases have been produced using yeast two-hybrid (Y2H) screens. The Y2H methods are extensively used to detect PPIs because of their scalability and accessibility. Several variants of Y2H methods have been developed and used by different research groups, increasing the accessibility of these methods and their applications in detecting different types of PPIs. However, the availability of variations on the same core methodology emphasizes the need to have a systematic comparison of available Y2H methods in the context of their applicability, coverage and efficiency. In this chapter, we discuss the key parameters of Y2H methods, namely proteins of interest, vectors, libraries, screening strategies, data analysis, and provide a flowchart that should help to decide which Y2H strategy is most appropriate for a protein interaction screen.

Assuntos

Ensaios de Triagem em Larga Escala/métodos , Mapeamento de Interação de Proteínas/métodos , Proteínas/metabolismo , Saccharomyces cerevisiae/metabolismo , Técnicas do Sistema de Duplo-Híbrido , Humanos , Ligação Proteica

16.

Proteome Data Improves Protein Function Prediction in the Interactome of Helicobacter pylori.

Wuchty, Stefan; Müller, Stefan A; Caufield, J Harry; Häuser, Roman; Aloy, Patrick; Kalkhof, Stefan; Uetz, Peter.

Mol Cell Proteomics ; 17(5): 961-973, 2018 05.

Artigo em Inglês | MEDLINE | ID: mdl-29414760

RESUMO

Helicobacter pylori is a common pathogen that is estimated to infect half of the human population, causing several diseases such as duodenal ulcer. Despite one of the first pathogens to be sequenced, its proteome remains poorly characterized as about one-third of its proteins have no functional annotation. Here, we integrate and analyze known protein interactions with proteomic and genomic data from different sources. We find that proteins with similar abundances tend to interact. Such an observation is accompanied by a trend of interactions to appear between proteins of similar functions, although some show marked cross-talk to others. Protein function prediction with protein interactions is significantly improved when interactions from other bacteria are included in our network, allowing us to obtain putative functions of more than 300 poorly or previously uncharacterized proteins. Proteins that are critical for the topological controllability of the underlying network are significantly enriched with genes that are up-regulated in the spiral compared with the coccoid form of H. pylori Determining their evolutionary conservation, we present evidence that 80 protein complexes are identical in composition with their counterparts in Escherichia coli, while 85 are partially conserved and 120 complexes are completely absent. Furthermore, we determine network clusters that coincide with related functions, gene essentiality, genetic context, cellular localization, and gene expression in different cellular states.

Assuntos

Proteínas de Bactérias/metabolismo , Helicobacter pylori/metabolismo , Mapas de Interação de Proteínas , Proteoma/metabolismo , Proteômica/métodos , Regulação da Expressão Gênica , Genoma Bacteriano , Helicobacter pylori/genética , Modelos Moleculares , Complexos Multiproteicos/metabolismo , Óperon/genética , Fenótipo

17.

Global landscape of cell envelope protein complexes in Escherichia coli.

Babu, Mohan; Bundalovic-Torma, Cedoljub; Calmettes, Charles; Phanse, Sadhna; Zhang, Qingzhou; Jiang, Yue; Minic, Zoran; Kim, Sunyoung; Mehla, Jitender; Gagarinova, Alla; Rodionova, Irina; Kumar, Ashwani; Guo, Hongbo; Kagan, Olga; Pogoutse, Oxana; Aoki, Hiroyuki; Deineko, Viktor; Caufield, J Harry; Holtzapple, Erik; Zhang, Zhongge; Vastermark, Ake; Pandya, Yogee; Lai, Christine Chieh-Lin; El Bakkouri, Majida; Hooda, Yogesh; Shah, Megha; Burnside, Dan; Hooshyar, Mohsen; Vlasblom, James; Rajagopala, Sessandra V; Golshani, Ashkan; Wuchty, Stefan; F Greenblatt, Jack; Saier, Milton; Uetz, Peter; F Moraes, Trevor; Parkinson, John; Emili, Andrew.

Nat Biotechnol ; 36(1): 103-112, 2018 01.

Artigo em Inglês | MEDLINE | ID: mdl-29176613

RESUMO

Bacterial cell envelope protein (CEP) complexes mediate a range of processes, including membrane assembly, antibiotic resistance and metabolic coordination. However, only limited characterization of relevant macromolecules has been reported to date. Here we present a proteomic survey of 1,347 CEPs encompassing 90% inner- and outer-membrane and periplasmic proteins of Escherichia coli. After extraction with non-denaturing detergents, we affinity-purified 785 endogenously tagged CEPs and identified stably associated polypeptides by precision mass spectrometry. The resulting high-quality physical interaction network, comprising 77% of targeted CEPs, revealed many previously uncharacterized heteromeric complexes. We found that the secretion of autotransporters requires translocation and the assembly module TamB to nucleate proper folding from periplasm to cell surface through a cooperative mechanism involving the ß-barrel assembly machinery. We also establish that an ABC transporter of unknown function, YadH, together with the Mla system preserves outer membrane lipid asymmetry. This E. coli CEP 'interactome' provides insights into the functional landscape governing CE systems essential to bacterial growth, metabolism and drug resistance.

Assuntos

Membrana Celular/genética , Escherichia coli/genética , Complexos Multiproteicos/genética , Proteômica , Membrana Celular/química , Proteínas de Membrana/química , Proteínas de Membrana/classificação , Proteínas de Membrana/genética , Complexos Multiproteicos/química , Complexos Multiproteicos/classificação

18.

Virus-host protein-protein interactions of mycobacteriophage Giles.

Mehla, Jitender; Dedrick, Rebekah M; Caufield, J Harry; Wagemans, Jeroen; Sakhawalkar, Neha; Johnson, Allison; Hatfull, Graham F; Uetz, Peter.

Sci Rep ; 7(1): 16514, 2017 11 28.

Artigo em Inglês | MEDLINE | ID: mdl-29184079

RESUMO

Mycobacteriophage are viruses that infect mycobacteria. More than 1,400 mycobacteriophage genomes have been sequenced, coding for over one hundred thousand proteins of unknown functions. Here we investigate mycobacteriophage Giles-host protein-protein interactions (PPIs) using yeast two-hybrid screening (Y2H). A total of 25 reproducible PPIs were found for a selected set of 10 Giles proteins, including a putative virion assembly protein (gp17), the phage integrase (gp29), the endolysin (gp31), the phage repressor (gp47), and six proteins of unknown function (gp34, gp35, gp54, gp56, gp64, and gp65). We note that overexpression of the proteins is toxic to M. smegmatis, although whether this toxicity and the associated changes in cellular morphology are related to the putative interactions revealed in the Y2H screen is unclear.

Assuntos

Proteínas de Bactérias/metabolismo , Interações Hospedeiro-Patógeno , Micobacteriófagos/fisiologia , Mycobacterium/metabolismo , Mycobacterium/virologia , Mapeamento de Interação de Proteínas , Proteínas Virais/metabolismo , Regulação Viral da Expressão Gênica , Fenótipo , Mapas de Interação de Proteínas , Técnicas do Sistema de Duplo-Híbrido , Proteínas Virais/genética

19.

Bacterial protein meta-interactomes predict cross-species interactions and protein function.

Caufield, J Harry; Wimble, Christopher; Shary, Semarjit; Wuchty, Stefan; Uetz, Peter.

BMC Bioinformatics ; 18(1): 171, 2017 Mar 16.

Artigo em Inglês | MEDLINE | ID: mdl-28298180

RESUMO

BACKGROUND: Protein-protein interactions (PPIs) can offer compelling evidence for protein function, especially when viewed in the context of proteome-wide interactomes. Bacteria have been popular subjects of interactome studies: more than six different bacterial species have been the subjects of comprehensive interactome studies while several more have had substantial segments of their proteomes screened for interactions. The protein interactomes of several bacterial species have been completed, including several from prominent human pathogens. The availability of interactome data has brought challenges, as these large data sets are difficult to compare across species, limiting their usefulness for broad studies of microbial genetics and evolution. RESULTS: In this study, we use more than 52,000 unique protein-protein interactions (PPIs) across 349 different bacterial species and strains to determine their conservation across data sets and taxonomic groups. When proteins are collapsed into orthologous groups (OGs) the resulting meta-interactome still includes more than 43,000 interactions, about 14,000 of which involve proteins of unknown function. While conserved interactions provide support for protein function in their respective species data, we found only 429 PPIs (~1% of the available data) conserved in two or more species, rendering any cross-species interactome comparison immediately useful. The meta-interactome serves as a model for predicting interactions, protein functions, and even full interactome sizes for species with limited to no experimentally observed PPI, including Bacillus subtilis and Salmonella enterica which are predicted to have up to 18,000 and 31,000 PPIs, respectively. CONCLUSIONS: In the course of this work, we have assembled cross-species interactome comparisons that will allow interactomics researchers to anticipate the structures of yet-unexplored microbial interactomes and to focus on well-conserved yet uncharacterized interactors for further study. Such conserved interactions should provide evidence for important but yet-uncharacterized aspects of bacterial physiology and may provide targets for anti-microbial therapies.

Assuntos

Bactérias/metabolismo , Proteínas de Bactérias/metabolismo , Mapeamento de Interação de Proteínas/métodos , Bacillus subtilis/metabolismo , Proteínas de Bactérias/química , Evolução Molecular , Humanos , Proteoma/metabolismo , Salmonella enterica/metabolismo

20.

The Protein Interactome of Mycobacteriophage Giles Predicts Functions for Unknown Proteins.

Mehla, Jitender; Dedrick, Rebekah M; Caufield, J Harry; Siefring, Rachel; Mair, Megan; Johnson, Allison; Hatfull, Graham F; Uetz, Peter.

J Bacteriol ; 197(15): 2508-16, 2015 Aug 01.

Artigo em Inglês | MEDLINE | ID: mdl-25986902

RESUMO

UNLABELLED: Mycobacteriophages are viruses that infect mycobacterial hosts and are prevalent in the environment. Nearly 700 mycobacteriophage genomes have been completely sequenced, revealing considerable diversity and genetic novelty. Here, we have determined the protein complement of mycobacteriophage Giles by mass spectrometry and mapped its genome-wide protein interactome to help elucidate the roles of its 77 predicted proteins, 50% of which have no known function. About 22,000 individual yeast two-hybrid (Y2H) tests with four different Y2H vectors, followed by filtering and retest screens, resulted in 324 reproducible protein-protein interactions, including 171 (136 nonredundant) high-confidence interactions. The complete set of high-confidence interactions among Giles proteins reveals new mechanistic details and predicts functions for unknown proteins. The Giles interactome is the first for any mycobacteriophage and one of just five known phage interactomes so far. Our results will help in understanding mycobacteriophage biology and aid in development of new genetic and therapeutic tools to understand Mycobacterium tuberculosis. IMPORTANCE: Mycobacterium tuberculosis causes over 9 million new cases of tuberculosis each year. Mycobacteriophages, viruses of mycobacterial hosts, hold considerable potential to understand phage diversity, evolution, and mycobacterial biology, aiding in the development of therapeutic tools to control mycobacterial infections. The mycobacteriophage Giles protein-protein interaction network allows us to predict functions for unknown proteins and shed light on major biological processes in phage biology. For example, Giles gp76, a protein of unknown function, is found to associate with phage packaging and maturation. The functions of mycobacteriophage-derived proteins may suggest novel therapeutic approaches for tuberculosis. Our ORFeome clone set of Giles proteins and the interactome data will be useful resources for phage interactomics.

Assuntos

Regulação Viral da Expressão Gênica/fisiologia , Micobacteriófagos/metabolismo , Mycobacterium smegmatis/virologia , Domínios e Motivos de Interação entre Proteínas/fisiologia , Proteínas Virais/metabolismo , Biologia Computacional , Espectrometria de Massas , Micobacteriófagos/genética , Mycobacterium tuberculosis/virologia , Mapas de Interação de Proteínas , Técnicas do Sistema de Duplo-Híbrido , Proteínas Virais/genética

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA